artdata.chinese <- artdata %>%
filter(country=="Chinese")Unit 1 homework sample solutions
DKU Stats 101 Spring 2025 Session 3
Introduction
Question 1: Describing your data (10 points)
1a. Where is this data from?
For this dataset, describe the data according to the five Ws & how defined in the textbook Chapter 1.2. What are some possible problems with the who and what of the dataset?
The original dataset can be found here.
Who: paintings that have been auctionedWhat: price of paintings sold, other physical characteristics of the paintingsWhen: scraped from a website in 2018, years sold varyWhere: various art auction housesWhy: to analyze the features of a painting that predict the priceHow: scraped by github user ahmedhosny and cleaned up by github user jasonshi10, data scraped from website artsalesindex
Possible problems:
It is not clear from the github documentation which cases were scraped, there could be some selection bias in both what was scraped (maybe only some types of cases were scraped) and in the selection of art houses the art website monitors. For the what, it also is not clear how some of the features of the paintings were calculated in the rows - presumably it was done by some type of machine learning algorithm.
1b. What are the variable types?
For the following variables, please list the variable type as defined in the textbook Chapter 1.3:
artist: either identifier or categoricalcountry: categoricalyearOfBirth: either identifier, quantitative, or categoricalname: identifieryear: either identifier or quantitativeageOfPaiting: quantitativeprice: quantitativematerial: categoricalheight: quantitativedominantColor: categorical
Question 2: Displaying and describing the data (15 points)
For the moment, we are going to focus on paintings by Chinese artists. You can create a subset of your data using the filter() verb as you learned in the DataCamp lab.
2a. Filtering your data
Using the filter() verb as described in the DataCamp lab, make a subset of your data that only includes art from Chinese artists. Show the code you used to make the subset using the #| echo: true code block option.
2b. Investigating height
Using the Think-Show-Tell framework from the textbook (example on page 71), investigate the distribution of the height of the Chinese paintings
Note: for this question and all other Think sections in the homework, you do not need to report the W’s of the data (you have already completed this in Q1)
Think
- I want to summarize the distribution of heights of Chinese paintings
- The data are height measurements of paintings from China contained in the dataset. The units are inches.
- Since we are reviewing the distribution of one quantitative variable, a histogram is the most appropriate display.
Show
| Statistic | Value |
|---|---|
| Count | 875.00 |
| Std Dev | 23.91 |
| Mean | 34.72 |
| Min | 4.92 |
| 25% | 14.96 |
| Median | 31.20 |
| 75% | 49.61 |
| Max | 192.13 |
| Artist | Name | Year | Height |
|---|---|---|---|
| Fang Lijun | 1996.1 (Triptych) | 1996 | 192.13 |
| Fang Lijun | 1999.3.1 | 1999 | 190.94 |
| Fang Lijun | 1996 | 1996 | 192.13 |
| Ma Baozhong | 19 December 1984 | 1997 | 146.00 |
| Ma Yanhong | Adulthood | 2006 | 190.00 |
The data appears to be right skewed. There appears to be a handful of suspicious outliers, but, on investigation, the values for height of each of these works appear geniune and therefore is no strong reason to exclude them.
Tell
After log transforming the data, the outliers no longer appear to be so serious. For the untransformed data, the mean and median are relatively close, though the mean is a little higher, likely due to the skew. The data is unimodal, right skewed, with a mean of 34.72. Based on the data, we can expect that the height of each Chinese artwork will vary from the mean by an average of about 23.91 (the standard deviation).
In more meaningful terms, the average artwork size height is about the size of three sheets of paper stacked on top of each other, which seems about normal considering how art paintings are usually produced. There is some variation, the standard deviation indicates that the size may vary by up to two sheets of paper worth of height, on average. The IQR is even larger, indicating a wide range of heights in the data. The outliers are more modern pieces of art that may be some type of more experimental art.
2c. Investigating width
Using the Think-Show-Tell framework from the textbook, investigate the distribution of the width of the Chinese paintings
Think
- I want to summarize the distribution of the width of Chinese paintings
- The data are width measurements of paintings from China contained in the dataset. The units are inches.
- Since we are reviewing the distribution of one quantitative variable, a histogram is the most appropriate display.
Show
| Statistic | Value |
|---|---|
| Count | 875.00 |
| Std Dev | 29.37 |
| Mean | 28.62 |
| Min | 3.15 |
| 25% | 13.82 |
| Median | 20.08 |
| 75% | 31.89 |
| Max | 468.50 |
| Artist | Name | Year | Width |
|---|---|---|---|
| Dai Jin | Grand View Of Mountains And Rivers | NA | 468.5 |
| Ma Baozhong | 19 December 1984 | 1997 | 283.0 |
The data for width is also very right skewed with a handful of very distant outliers. One of the outliers was also an outlier in the height investigation (indicating a very large painting) and the other is a very long ink scroll. Both appear to be genuine measurements.
Tell
A log transformation makes the distribution nearly perfectly symmetric, indicating that the data-generating process is probably logrithmic. In untransformed terms, the distribution is right skewed and unimodal, with a mean of 28.62. Based on the data, we can expect that the width of each Chinese artwork will vary from the mean by an average of 29.37.
In this case, the average width is a little smaller than the average height, at about 2.5 sheets of paper wide, on average. The standard deviation is actually nearly the same as the mean, in part due to the extreme outliers, but even the IQR is about 20 inches, indicating that there is significant variability in width, though less so compared to height. The large outlier indicates that some paintings may be primarily horizontally oriented and others primarily vertically oriented, which may account for the large amount of variation.
2d. Thinking about your results
Consider the results of 2b. and 2c. together. What can we understand about Chinese art from viewing the distribution of these two variables?
Answers will vary here, good quality effort to interpret investigation of this question is required.
Question 3: Relationships between categorical variables - American and Chinese artists and oil vs. ink. (15 points)
3a. Recoding your data
Using the mutate() verb and the case_when() verb combined with grepl(), create two new variables. The first is material.type and the second is us.china. The first variable should recode material to be either Oil, Ink, or Other, depending on whether the original values of material contained either the words oil or ink. The second variable should make a similar transformation to country where you recode the variable to be either American, Chinese, or Other. Show the code you used to make the new variables using the #| echo: true code block option.
Hint 1: you can see some examples of case_when() and grepl() hereand here .
Hint 2: make sure to use the ignore.case=TRUE option in grepl()
artdata.uschina <- artdata %>%
mutate(material.s = case_when(grepl("oil",
material,
ignore.case = TRUE) ~ "Oil",
grepl("ink",
material,
ignore.case = TRUE) ~ "Ink",
TRUE ~ "Other"),
country.s = case_when(country == "American" ~ "American",
country == "Chinese" ~ "Chinese",
TRUE ~ "Other"))3b. Investigating the categorical relationship between us.china and material.type
Investigate the relationship between us.china and material.type
Hint 3: you can see an example of some ways to display this information here
Think
Show
| Material/Country | American | Chinese | Other | Total |
|---|---|---|---|---|
| Ink | 1819 | 585 | 1028 | 3432 |
| Oil | 1988 | 152 | 11068 | 13208 |
| Other | 6856 | 138 | 16436 | 23430 |
| Total | 10663 | 875 | 28532 | 40070 |
| Material/Country | American | Chinese | Other | Total |
|---|---|---|---|---|
| Ink | 17% | 67% | 4% | 9% |
| Oil | 19% | 17% | 39% | 33% |
| Other | 64% | 16% | 58% | 58% |
| Total | 100% | 100% | 100% | 100% |
Tell
3c.Thinking about your results
Think carefully about why you have observed this result and provide some additional information about what this investigation means for understanding this dataset and art in general.
Answers will vary here, good quality effort to interpret investigation of this question is required.
Question 4: Comparing groups (15 points)
4a. Recoding your data
Similar to the previous question, create a new variable called famous.countries that recodes country to be either American, French, Italian and Spanish. Mark art from all other countries as NA (the code that stands for missing or not available in R). Additionally, create a new variable called area that is a calculation of the area of the art (height times width). Show the code you used to make the new variables using the #| echo: true code block option.
artdata.famous.c <- artdata %>%
mutate(country.f = case_when(country=="American" ~ "American",
country=="French" ~ "French",
country=="Italian" ~ "Italian",
country=="Spanish" ~ "Spanish",
TRUE ~ NA)) %>%
mutate(area = height*width) %>%
filter(!is.na(country.f))4b. Compare the groups of countries on the variable price
Think
Show
Comparing distribution of price across select countries
Tell
4c. Compare the groups of countries on the variable area
Think
Show
Warning: Removed 1312 rows containing non-finite outside the scale range
(`stat_boxplot()`).
| artist | country | name | height | width | area |
|---|---|---|---|---|---|
| Andy Warhol | American | Franz Kafka | 4055.12 | 3228.35 | 13091346.7 |
| Andy Warhol | American | Electric chair | 355.12 | 479.92 | 170429.2 |
| Andy Warhol | American | Indian Head Nickel (F. & S. IIB.385) | 399.61 | 399.61 | 159688.2 |
| Andy Warhol | American | Electric chair (Feldman 82) | 355.12 | 479.92 | 170429.2 |
Comparing distribution of area across select countries
Tell
4d. Thinking about your results
Consider the results of 4b. and 4c. together. What can we learn about the differences in art between the countries? What do you think causes these differences or similarities? How would you confirm your guess as to the cause of the differences/similarities?
Answers will vary here, good quality effort to interpret investigation of this question is required.
Question 5: Considering deviations (10 points)
5a. Selecting your data
Pick three years of paintings to investigate whether the brightness of paintings has changed over time. You are free to pick any three years but you should pick years that correspond to different periods in art history. State the three years and justify your selection.
Many possible options here, in this example I will use 1888, 1920, and 1950
5b. Finding the average
Calculate the average brightness for each of the three years. Show your code using the #| echo: true code block option.
artdata %>%
filter(year==1888 | year==1920 | year==1950) %>%
group_by(year) %>%
summarize(mean.brightness = mean(brightness, na.rm=TRUE)) %>%
mutate(Year = as.character(year),
Brightness = round(mean.brightness, 2)) %>%
select(Year, Brightness) %>%
kbl %>%
kable_styling()| Year | Brightness |
|---|---|
| 1888 | 128.00 |
| 1920 | 148.98 |
| 1950 | 151.59 |
5c. Normalizing the data
Find how many \(z\) units each of the averages for the years are away from the overall mean of brightness and interpret your results.
Think
Show
| Year | Mean | Z Score |
|---|---|---|
| 1888 | 128 | -0.37 |
| 1920 | 148.98 | 0.04 |
| 1950 | 151.59 | 0.09 |
| Mean overall | 146.83 | |
| SD overall | 51.09 |
Tell
5d. Thinking about your results
What are some of the implications of your findings with regard to the motivation of this question? What are some of the limitations of this analysis? What other kind of analysis would you like to do to answer this question?
Answers will vary here, good quality effort to interpret investigation of this question is required.
Question 6: Your own investigation (15 points)
6a. Selecting your own question
Similar to the previous questions, think of your own question that you would like to ask of the data. Use the Think-Show-Tell procedure to conduct your investigation. Think deeply about what your result means.
Think
Show
Tell
Answers will vary here, good quality effort to interpret investigation of this question is required.
6b. In summary
Sum up everything that you have learned in this investigation. Do not simply repeat/rephrase your previous results but try to say something larger that synthesizes the results together to draw a more meaningful general conclusion.
Need to think deeply about what information this dataset provides for full points.